Quality of Red Wine by TU YEMEI

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Univariate Plots Section

At first, we can have an overview of the distribution of wine quality:

I am confused about the distribution, I already made the binwidth to 0.5, but the distribution seems quite weird.
I want to have a view of the other chemical properties distribution

make some adjustment to the chemical properties
## $x
## [1] "chlorides"
## 
## attr(,"class")
## [1] "labels"

The free sulfur dioxide and total sulfur dioxide is positively skewed

Univariate Analysis

What is the structure of your dataset?

13varibles,1599 obervations

What is/are the main feature(s) of interest in your dataset?

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I think three chemical varibles of acid, two varibles of sulfur can be combined as single varible, and with the other chemical properties.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

some varibles have long tail, and i limit the x scale to the 95% in order to remove some outliers.

Bivariate Plots Section

I want to add some lines to show the general trend

In this plot, we can observe that alcohol inreases with the quality.

Labeling quality with different levels & boxplot

I want to label quality of wine into several different category, thus I can use box plot to explore it.

Make the same analysis for the other 3 factors and explore the plot

negative relationship between quality and volatile acidity

positive relationship

It is clearly that higher quality has higher content of alcohol.
Since there are six levels, and number of the highest and lowest level are too small to be observed clearly, I plan to combine two levels together, and divide the six levels into three wider levels: Low, Median, High.

In this way, we can get a more clear overview of alcohol distribution and we can apply the same analysis to the other three factors.

Density performs better than histogram in volatile acidity analysis, since from the upper plot, we can only conclude that most wine in the 0.5-xais ,and from the lower plot, higher quality requires lower acidity

Higher quality requires higher citric acid.

This plot is long tale which needs some function, I omit the top 1% of sulphates.

Explore two main variables

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?
At first, I calculate the Pearson Correlation of different variables, especilly focus on relationships with quality. And I explore more on top 4 factors influencing quality the most, it turns out that volatile acidity has a negative influence, while citric acid and alcohol have a positive influence on quality.
Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?
I made a plot to explore chemical properties, which is proved to have weak correlation with quality, and I find that three variables, fixed acidity,free sulfur dioxide and total sulfur dioxide present some peaks.
What was the strongest relationship you found?

Multivariate Plots Section

We already know that alcohol contributes a lot to the quality of wine, and now, I want to insert other variables to see if they contribute to the quality in other way.

From the plot above, takes level-3 for example, it ranges from 0.9925-1.0000, which indicates that density contributes little to quality level, which the alcohol has a weak negative influence on density telling from the plot.

Since the plot is hard to explore, thus I use facet_wrap function to divide it. From the plot, We can infer that low quality between 1-3 lies around 0.5 in y-axis, while quality level of 4-6 tend to have higher y-axis, and also x-axis, thus higher quality wine have higher alcohol and sulphates.

The influence of PH is not obvious

Every quality level has a large range of y-axis, which means residual sugar has little influence on quality.

Lower total sulfur dioxide and higher alcohol produces higher quality.

No obvious correlation

linear analysis:

Predict the wine quality based on chemical properties
## 
## Calls:
## m1: lm(formula = quality ~ volatile.acidity, data = wine)
## m2: lm(formula = quality ~ volatile.acidity + alcohol, data = wine)
## m3: lm(formula = quality ~ volatile.acidity + alcohol + sulphates, 
##     data = wine)
## m4: lm(formula = quality ~ volatile.acidity + alcohol + sulphates + 
##     citric.acid, data = wine)
## m5: lm(formula = quality ~ volatile.acidity + alcohol + sulphates + 
##     citric.acid + chlorides, data = wine)
## m6: lm(formula = quality ~ volatile.acidity + alcohol + sulphates + 
##     citric.acid + chlorides + total.sulfur.dioxide, data = wine)
## m7: lm(formula = quality ~ volatile.acidity + alcohol + sulphates + 
##     citric.acid + chlorides + total.sulfur.dioxide + density, 
##     data = wine)
## 
## ==========================================================================================================================
##                              m1            m2            m3            m4            m5            m6            m7       
## --------------------------------------------------------------------------------------------------------------------------
##   (Intercept)               6.566***      3.095***      2.611***      2.646***      2.769***      2.985***     -0.953     
##                            (0.058)       (0.184)       (0.196)       (0.201)       (0.202)       (0.206)      (11.990)    
##   volatile.acidity         -1.761***     -1.384***     -1.221***     -1.265***     -1.155***     -1.104***     -1.114***  
##                            (0.104)       (0.095)       (0.097)       (0.113)       (0.115)       (0.115)       (0.120)    
##   alcohol                                 0.314***      0.309***      0.309***      0.292***      0.276***      0.280***  
##                                          (0.016)       (0.016)       (0.016)       (0.016)       (0.017)       (0.020)    
##   sulphates                                             0.679***      0.696***      0.871***      0.908***      0.903***  
##                                                        (0.101)       (0.103)       (0.111)       (0.111)       (0.112)    
##   citric.acid                                                        -0.079         0.021         0.065         0.044     
##                                                                      (0.104)       (0.106)       (0.106)       (0.124)    
##   chlorides                                                                        -1.663***     -1.763***     -1.747***  
##                                                                                    (0.405)       (0.403)       (0.406)    
##   total.sulfur.dioxide                                                                           -0.002***     -0.002***  
##                                                                                                  (0.001)       (0.001)    
##   density                                                                                                       3.923     
##                                                                                                               (11.944)    
## --------------------------------------------------------------------------------------------------------------------------
##   R-squared                 0.153         0.317         0.336         0.336         0.343         0.352         0.352     
##   adj. R-squared            0.152         0.316         0.335         0.334         0.341         0.349         0.349     
##   sigma                     0.744         0.668         0.659         0.659         0.656         0.651         0.652     
##   F                       287.444       370.379       268.912       201.777       166.407       143.910       123.298     
##   p                         0.000         0.000         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood        -1794.312     -1621.814     -1599.384     -1599.093     -1590.662     -1580.192     -1580.138     
##   Deviance                883.198       711.796       692.105       691.852       684.595       675.689       675.643     
##   AIC                    3594.624      3251.628      3208.768      3210.186      3195.324      3176.384      3178.276     
##   BIC                    3610.756      3273.136      3235.654      3242.448      3232.964      3219.401      3226.670     
##   N                      1599          1599          1599          1599          1599          1599          1599         
## ==========================================================================================================================
The model can be described as:wine_quality = 2.985 + 0.276xalcohol - 2.985xvolatile.acidity + 0.908xsulphates + 0.065xcitric.acid - -1.763*chlorides - 0.002xtotal.sulfur.dioxide
Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?
The density contribute little to quality level, but the alcohol has a negative influence on density. Higher quality wine have higher alcohol and sulphates. The influence of PH is not obvious. Residual sugar has little influence on quality. Lower total sulfur dioxide produces higher quality.
Were there any interesting or surprising interactions between features?
From bivariate plot, we can find some variables, such as residual sugar have peaks with x-axis of quality, but when inserting other variables into analysis, the result show that residual sugar has little influence on quality.
OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.
The model can be described as:wine_quality = 2.985 + 0.276xalcohol - 2.985xvolatile.acidity + 0.908xsulphates + 0.065xcitric.acid - -1.763*chlorides - 0.002xtotal.sulfur.dioxide

Final Plots and Summary

Plot One

Description One

This data most lies in quality level of 5-7, and alcohol has an obvious positive influence on quality, the better quality , the higher alcohol percentage. The line is clearyly showed the trend.However, from the linear modeling anlysis, alcohol plays an important role, but only up to 27%, is not the only factor resulting the quality of wine.

Plot Two

Description Two

Plot Three

Description Three

In general, high quality wine tend to have higher alcohol and lower volatile acidity content. They also tend to have higher sulphate and higher critic acid content.

Reflection

The red wine dataset contains 1,599 observation with 11 variables on the chemical properties. I focus on the correlation between chemical properties and quality, and explore which varibles has the most influence on quality, futhermore, when analyzing multivariate, I even figuerd out the correlations of different chemical properties besides with quality. And, in the last, I made linear modeling in order to quantify the influence exactly.
However, other chemical properties shows weak correlation with quality, either from visualization or statistic calculation. Wine quality is a complex problem, it is influenced by many factors, thus I used linear modeling to analyze it which is over simplified model.
In my opinion, the variables of wine are not very suitable for analyzing, since only 4 factors are proved to have correlation with quality. I propose that the data should be added some useful variables for further analysis, such as produce_place, temperature, water percent, environment, year. Plus, most data of this data set is between quality level of 5-6, low quality, and high quality have small scale of data, we should be provided with more data of this level to have some deeper analysis.